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Investigated vere the effects of tvo levels of 
penalty for incorrect responses on two dependent variables (a seasnre 
of risk-taking or confitPence, based on nonsense iteast and the noaber 
of response^-attespts to legitisate itess) for three treatsent groaps 
in a 2x3, solti-respcnse repeated aeasnrest snltivariate ABOTA 
(Analysis of Variance) design. Subjects responded under one of three 
scoring-adsinistrative roles: conventional cooabs^type directions and 
tvo variants suggested as aathesatically sore adequate. Besults 
indicated significant differences both aaong groups and across 
conditions. The results eere discussed with reference to the question 
of test validity in generalr and the probleas posed for 
' criterioo*refe*^Bnced aeasureaent. (Author) 



ERLC 



Behavior on Objective Tosts Under 
Theoretical iy Adequate, Inadequate and Unspecified Scoring Rules 



Stanley S. Jacobs 
University of Pittsburgh 



Paper presented at the annuel meeting of the 
American Educational Research Association 
Chilcago, Illinois 



April 1974 



BohBvlor on Objective TMtt Ti9oritlcol ly AdeqMSt^, 

ln5d#qMt# and Unsf^lfted Scoring RulM 



StanUy S. Jacobs, University of Pittsburg 



Abstract 



InvMtfgatad Mra tha affacts of tito lavals of panafty 
for tncorract rasponsas on tiio dapantfant var lab las (a mm^ 
sura of rfsk-takfng or confldanoa, basad on nonaanaa ftaas, 
and Itia nunbar of raspo ns a a t f aw p ts to lagltlnta Jtasv) 
for thraa traatmant groups In a 2 k 3, «ultl«rasponsa ra- 
paatad Masuras, isultlvarlata ANOVA dasign* Ss raspondad 
undar ona of thraa soorfng-adnlnlstratlva rulas: convaf^^ 
tlonal CooNt>s-*typa diracttons and fwo variants suggastad ' 
as matttamatlcally Mora adaqueta. Raaults Indloatad signi- 
ficant diffarancas botti amng groups and across conditions. 
Tha rasults wans discussad wilt) rafaranca to Itia quastlon 
of tast validity In ganaral, and ttia pnoblaMS poaad for 
critarlon-rafarancad iwaasu rSfrian t. 

A numbar of altarnatlva admlnlstratlva and scoring procaduras for objac- 

tlva tasts hava baan suggastad (a.g. Coonbs, 1993; da FInattI, 1969; Ebal, 1969; 

Rlppay, 1968) irtitch hava as thair coamon objactlva a mora adaquata ass as s m ant 

of tha dagraa of partial knoirlad^a hald by a givan studant with rafaranca to a 

given Itam.^ 

A procadura knourn as *optlon«e I Iml nation* or *CooMbs«typa directions* 
(CTD) seams quite applicable to liia typical ciassroom testing situation. With 
CTD, the student Is requtrad to Identify as many of tha J*l distractors among 
the J Item options as ha or she Is able. With the usual scoring rule, a stu- 
dent earns one point for each distractor so Identified. A penalty of -(J-l) 
points Is sufferad If the corract answer Is Identified as a distractor. Item 
scoras» then, can range from -(J-l) points to ^(J^t) points, having 2(J-I)4*t 



^See Echternacht (1972) for a comprehensive description and review of a 
niMbar of alternative testing procedures. 
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potsibU vsluM, rtlt^•r than slflfply the I or 0 Mfmd uod^r conventional con- 
ditions. Ap^rently^ CTD hava the potential for discerning Interaedlate levels 
of knoirledge. 

Hrltx and Jacobs (1970) deraonstroted^ hoi^ever^ that the problems asso* 
elated tfllh the correctlon-for-guesslnq^ do cu me n ted by Votaw (1936) and Sher- 
riffs and Boamr {\9^4)^ have apparently been simply shifted froM the Iteai 
level the option lavel under CTD. Usf#n C^D piW Ss behaved cooservatlyely^ 
Identifying too few distractors. There seem to be reliable and extrema tndl* 
vidua I differences In the tendency to respond to Items under an announced 
guessing penalty and» similarly^ to Identify distractors under a procedura 
which effectively Incorporates such a penalty. Furthermore, these dtf fei^ritlal 
response tendencies seem to be moderated by personality variables unrelated to 
the varRable maosurvd by the test under scrutiny (Slakter, 1968; Jacobs, 1971). 

It has been suggested that the differential tendency to ''take risks" In 
the Identification of some number of the J-l options may be controlled by In- 
creasing the level of penalty Imposed for Incorrect responses (Arnold and 
Arnold, 1970, p. 13). There Is, to the author's knowledge, no direct empirical 
evidence for this suggestion where CTD are concerned. However, If one extends 
the "argument by analogy" from the similar results obtained In the Sherrlffs 
and Boofaar (19^) and Hrltz and Jacobs (1970) studies^ the results of Maters 
(1^67) may be relevant. Waters found that Increased levels of penalty re- 
sulted In significant Increases In fhe n^fnbe^r of omitted Items In a conventional 
multlple*cholce tasting situation. One might hypothesize that Increasing the 
level of penalty under CTO would result In analogous behavior; I.e., Ss will 
Identify fewer of the J-l distractors. 

Arnold and Arnold (1970) have also suggested that the usual credtt*pen- 
elty arrangemant used with CTD, described above. Is mathematically Inadequate 
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Th^y tn#n present, under a game^tt^^retlc model, whet Is proposed as a more 
edequete system. The credlt*penalty arrangement is such that tlie following 
ere the **felr scores^ assigned to various responses to a four-option multiple- 
choice Item. 

TABLE I 

Fair Scores for a Four-Option Hul tIpie-Cholce Item, as Developed by 
Arnold end Arnold (1970), and Used In the **A & A, Specified'' 
Condition In the Present Study 







Outcome 


Fair Score 


Including the correct answer In the 
set of options identified as distrac* 
tors 




No j[lstr*ctors Identified 


0 


One distractor correctly Identified 


!/3 


Two distractors correctly identified 


1 


Three distractors corroctly identified 


3 



Unfortunately, some of the derivations involve 1t>e Introduction of the 
assumption of a rardom-guesslng model, similar to that which Is Involved In 
the derivation of the usual guessing correction formulae, it seems Inconsis- 
tent to develop a model for behavior under CTD (which theoretically Involves a 
rational partitioning of item options Into two sets, one of which S feels con- 
tains the correct answer) wnich assumes a random process In responding. Also, 
the data offered by Arnold and Arnold in support of their scoring procedure 
am suspect, since Ss In their study were simply told that if they guessed 
their expected gain would be zero. No actual Information as to credit or pen- 
alty was provided to give Ss some basis for decision making. There Is evidence 

ERIC 
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that avM minor changes In ihe directions provided Ss can produce signlf Icont 
changes In tMt scores, on measures of cognitive variables (Yamamoto and Olz* 
ney^ 1969) and on measures of personality (Jacobs, 1972) , obtained under mora 
conventional conditions. 

The purpose of the present study was to examine the effects of Increas- 
ing panalty level, under three types of CTD Instructions, on performance on a 
multiple-choice vocabulary test. The three credit -penalty arrangements were: 
(I) the conventional approach described above and (2) two variants of the 
Arnold and Arnold approach; (a) one with all weights specified and (b) one 
without, with Ss simply Informed that guessing would result In a zero expected 
gain (or. In the case of Increased penalty, an expected loss to S) . Under I 
and 2a, announced penalties were doubled under the Increased penalty condition. 

Method 

Materials 

Two randomly parallel 30 Item multiple-choice vocabulary tests, which 
Included 10 nonsense Items for use with Slakter*s (1967) measure of risk-taking 
were developed for the study. (See Table 2 for descriptive data.) 



TABLE 2 

Descriptive Data for the Two Randomly Parallel 
Vocabulary Tests Used In the Present Study 







Form 


n 


k 


X 


s.d. t* 






< CD 


32 
32 


40 
40 


24.9 
25.6 


5.5 .99 (ns) 
6.7 


i.59 (ns) 


.76 



*t»st on means 
•test on varlonoas 
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Th« data prasMtad In Tab la 2 ara batad on 'tt>a .40 lagltlmta ttans only, 
and wara ootlactad on a group of 52 Ss similar to thosa In ma prasant study. 
Tha tatta wara admlnlstarad In a constant A-6 ordar for all Ss, undar Instruc- 
tions which Indlcatad Ss should raspond to all Itwns; scoras »ara basad on tha 
nmtor oorract. Tha t end F tasts Indlcata that lha raaans and varlanoas, r»- 
spactlvalvt ara not significantly diffarant (p > .05). Tha batwaan-fom* cor- 
ralatlon of .76 Indlcatas tha two forms produce data ««frlcioofiv »>r»»«i»i«„* 
to Justify "thair usa In a r»i>Mtl»4t 

dependent variobUs iKire mmd: a mMsura of rotifld#noa or risk- 
taktng^ basad on Slakter't (1967) formula and rwponsM to Itia 10 nonsonsa 
ifMS on Mch form, and a ralatad Indax, ttia niMbar of Itam options rwpondad 
to on lha 40 iagttlmata Itaw on #ach form. Thm dlfforancas In soorfng r<ilM 
iwkas a oomparlson of actual sooras maanlnglass If coMput^ according to t|H^ 
rulas prasantad. 

Subjacts 

Aftar davalopmant and tryout of Ifta two vocabulary tasts to ba usad In 
tha prasant study, subjacts wara recrultad from tha enrol tiMnt of iyno sections 
of tha tnfroductory mastar*s laval rasaarch methods oourse In tha School ot 
Education at the University of Pittsburgh. They were asked to participate In 
Q study to be conducted during an hour of class time, which would serve as a 
vehicle for subsequent lectures and discussions throughout tha tarma All stu- 
dents agreed to participate. 

Procedure 

The 87 Ss were randomly assigned to the three treatment groups mantlonec? 
above. Each £recalved Instructions, several workei5 examples, and test mate- 
rials to enable Individual work without need for explanation from E. This al<* 



loured Ss In different t TM t m g n ts to rtMln In th« saro room. An error In as* 
sIgnMnt resuittd In ona S^balng misasslgnad (nota Itta n*s In Tabia 3). Thir- 
ty nintitas iNra alloMd for form A, whereupon all materials were collected, 
and materials for form B distributed. Thirty minutes were also allowed for 
form B. All materials were than collected, and Ss were thoroughly debriefed 
concerning the study. 

Design 

The design of the study was conceptualized as a 3x2 full rank multl- 
response repeated m«^sures design and was analyzed as such using procedures 
discussed by TImm and Carlson (1973) and Tlirvn (1974). 

The design of the study Is presented In Figure I, to enable the reader 
to relate the various hypothesis tests to the design. 



Treatments 


Dependant Variable UVj) 


Dependent Varlabie 2(V2} 


Condition l(C|) 


Condition icCj) 


Condition KC,) 


Condition 2(C2> 




"II 


"12 


"13 






"21 


"22 


"23 


"24 




"31 


"32 


"53 


"34 



Fig. I. Plan of the 3x2 MultlH-esponse Repeated Measures 
Design Used In Ihe Present Study 

e 

Results 

K Results descriptive of the effects of treatments and conditions on the 
two dependent variables employed are. presented In Table 3 In a format cons!s-* 
tent with Figure I, and Illustrated In Figures 2 and 3. The Intercorrelattons 
^ Of the dependent variables are presented In Table 4. 
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The data presented In Table 3 Indicate that the Increased penalty had a 
similar Ifnpact on the two dependent variables In the three experimental groups; 
In all casss, Inci-eased penalty (either specified or lmp|]e<]) resulted In a de* 
crease In Index avera^s. It may also bo seein that the two dependent variables 
are significantly aad substantially correlated within penalty conditions within 
Ireatment groups. 



TABLE 5 

Meana and Standard Deviations of the Two Dependent Variables Used, Under the 
Two Penalty Conditions for the Three Experimental Groups In the Present Study 





Penalty Conditions and Variables 


risk (V|) 


response attempts (V2) 


low penal ty(C|) high penal+y(C2) 


low penalty(Cj) high penal ty(C2) 


-0(T|) 

A, specif led(T2) 
A, unspecif led(Tj) 


n X s.d. X s.d. 


x* s.d. X s.d. 


30 14.60 6.74 6.97 7.96 

28 I/.68 8.49 10.18 8.30 

29 19.59 6.72 12.59 7.72 


81.73 14.20 71.70 15.71 
85.82 17.56 76.39 20.72 
93.55 15.62 84.86 17.18 




Fig. 2. Plot of Data Points for Variable One, Risk-taking or Confidence, 
Across Two Penalty Conditions (C| and C2) 
Q for the Three Treatment Groups 
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Ffg. 3. Plot of Data Points for Variable Two, Number of Response Attcf!ipts, 
Across the Penalty Conditions (C, <^^^» Thro« Titftfimsnt broiips 

f\ ff,f,i4$.,^. 9^4t9 r-fest was performed to test differences anonq treatments. 







^ »'2l \ 




f \ 

»'3l 




n 




m 


»'32 






»'23 




»'33 


" »'|4 . 




> »'24 ■ 







In other words, the significance of differences among the Tj, and 
mean vectors was tested. The results of this analysis are presented In 
Table 5; ihe multivariate Fg 1.795 has a chance probability of approxl* 

mateiy .OS. (See Table 5) 
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TABLE 4 

Intaroorrolatlons of Itie Two DaporuSant Variables Under the Two Penalty 
Conditions for the Three Experf mental Groups of the Pr^esent Study^ 



E>q>erlii»ntal 
Groups 




TrMtmsnt Conditions and Variables 






risk (V|) 


response atteMpts (V,) 






low penalty(C|) high penalty(C2) 


low penaltY{C|) high penalty (C2) 






1.00 






CTO <T,) 
(n • 30) 




.63 


1.00 






.85 


.29 


i.oo 






.49 


.55 


.52 1.00 




V\ 


1.00 






A & A 

'>eclfl«<](T,) 
(ri - 26r 


Vl 


.68 
.88 


1.00 
.55 


1.00 




V2 


.55 


.79 


.65 1.00 






1.00 






AAA 

'.specif led(T,) 
(rt - 29) ^ 


^^2 
Vl 


.45 
.85 


1.00 
.22 


1 .00 




^2^2 


.45 


.58 


.54 1.00 



for n « 30, df « 28, r > .306, p < .05 and r > .463, p < .01 
for n « 29, df « 27, r > .311, p < .05 and r > .471, p < .0! 
for n ■ 28, df • 26, r > .317, p < .05 and r > .479, p < .01 
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TABLE 5 

Multtvartata ANOVA Sunnary Table; "Traatments" Hypolttesis 



Souroa 



OF 



SSP 



Multtvartata F 



p-vatua 



ra^lmants (T) 



rror 



6 



162 



382.112 

801.143 2115.842 
392.378 964.627 
893.491 2351.827 



(Sym) 

469.501 

1075.491 2614.493 

(Sym) 



4311.842 

8185.214 21005.(46 
2861.542 5839.247 5366.108 
5438.98 1 1 3775. 77 > 7836.081 27008.427 



t.795 



.0814 



This analysts tests for the colnctdence of the four data potnts obtained within 
each of the three treatments. The overall F Is regarded as Indicating a sig- 
nificant departure from coincidence (p < .10). 

Confidence Intervals were calculated for differences between treatments^ 
data points. It was determined that significant differences for VjCj, ^jCj, 
V^Cj, and ^2^2 between T|(CTD) and Tj (A i, A, unspecified), only (p < ,10), 

A multivariate F-test was tt>en performed to test the slgnlf Icance of 
differences across the two penalty conditions, Cj end Cj, sImuJtaneously for 
both variables. 



H 



^'ll 




f \ 

»*I2 


^21 






»'3I 


m 




»*I3 






^23 




»*24 


. »'33 - 




. ^34 < 



Tne multivariate F^ of ^4.702 has a chance probability of less than .001; 
a summary of the analysis Is presented l#i*lSble 6. 
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TABLE 

' Multtvarlat* ANOVA Sumary Tabl«; '"Condtttons" HypotttMis 



.Souros 


df 


SSP 


Multlvarlats F 




Conditions (C) 
Error 


6 
166 


f 4884.134 6131.934 ) 
I 6131.934 7698.971 J 

f 3954.867 6743.067 ] 
[ 6743.067 20462.031 J 


14.702 


< .001 



The calculation of conftctenca tntarvals for the obtalnad dfffarancas 
across penalty conditions Indicated that the Increased level of penalty had a 
significant effect (p < .05) on variable I (a measure of confidence or risk* 
taking), but not variable 2 (the nurrber of responses made to the options of 
legitimate Items), and the effect was similar for all three treatment groups^ 

A multivariate F*tast of the Interaction (T x C) hypothesis produced a 
nons4$nJlf leant: of 0.70 (p ■ .9911). In view of th^ nonsignificant 

Interaction, two additional analyses were performed. 

The first analysis compared differences among treatment groups by oon- 
trastlng vectors of V., averages simultaneously, collapsed across conditions: 



'»'U*»'I2 




^ »^l*»*22 " 




' »'3l^32 " 


2 


m 

•I 


2 


m 


2 






»^3*»^4 




»'33^34 


2 

V * 




. 2 




2 



The multivariate F-test produced an F^ of 2.941, which has a chance pro- 
bability of approximately .0221. The results of this analysis are summarized 
In Table 7. 
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TABLE 7 

\ 

Multlvarlat« ANQVA SumMry TabU; **TrMtiMnts*'' Hypo1t»Mts 



Sourca 


df 




Multlvarlata F 


p-valua 


Traatmnts* (T'> 

Error 


4 

166 


f 395.592 933.689 ) 
[ 933.689 2356.497 J 

f 3650.258 6324.880 ] 
[ 6324.880 18891.279 J 


2.941 


.0221 



Conf Idenos Intervals calculated Indicated the significant differences 
Mere between T| (CTD) and (A&A, unspecified), only (p < .OS) and only for 
variable I (rlsk*taklng, collapsed across conditions). 

The second analysis contrasted conditions C| and by contrasting vec* 
tors of V|, V2 averages simultaneously over the three treatinant groups: 













U J 




ry,./3 



The multlvarlata F-tast rasultad In an F2 of 55.566, p < .0001. A sunmary 
of this analysis Is prasantad In TabIa 8. 

TABLE 8 



Multlvarlata ANOVA Summary TabIa; "Conditions"* Hypotttasis 



Sourca 


df 


/ 


Multivariate F 


p«^alua 


Conditions* (C*) 


2 


f 4860.935 6099.982 ] 
{ 6099.982 7654.860 J 


55^566 


• V 
< .0001 


Error 


83 


f 3954.867 6743.067 1 
[ 6743.067 20462.031 J 
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When cx>nf Idence Intervals were calculated for obtained differences ac* 
ross conditions Cj and for variables (Vj and Vj) collapsed across treat- 
ments (T|, ^2 ^3** found that differences for bott> variables were 
significant* That ls» when •^condition effects*' are obtained by averaging ac- 
ross treatment groups^ the effect Is signlf Icnnt for both dependent vari- 
ables (p < .05). 

Surynary and Discussion 

Although the descriptive data presented In Table 3 and In Figures 2 
and 3 seem to Indicate that Increased penalty-level has a similar effect on 
both dependent variables^ several analyses Indicate that V|» Slakter*s mea* 
sure of risk-taking^ Is apparently the more sensitive and consistent dependent 
variable. 

The test of the first "treatments" hypothesis (see Table 5) Indicated 
significant differences between T| and T^ for both dependent variables and 
both conditions^ but with p < .10. A second test of this hypothesis^ based 
on V|, V2 averages, collapsed across conditions Indicated that significant 
differences existed between T| and T^ (p < .05), but only for V| (see Table 
7). 

When both "conditions" hypotheses were tested (see Tables 6 and 6)« 
V| reflected a significant decline from Cj to C2 In both analyses; did so 
only In the latter analysis. 

Variable one* then* a measure of risk-taking (as Slakter termed It) or, 
more descriptively, a measure of confidence e>d)iblted In attempting to answer 
whet j\<uu<jdiy dp^ar to Ss as very diff icult Items, seems to be more effected 
by both ttie Increase in penalty and the iffVM^^tlons In test directions. 

Alttiough It is usually not considered good practice to develop proce- 
dures unde- one set of condltlorvs, with the oxpiiciotton they will readily gen- 
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•ralln to another set, the data of the present study Indicate that the dif- 
ferences In behavior belveen ccndltlons where the details of a scoring rule 
are presented, and where they are not, are not statistically significant. How- 
ever, with respect to CTD In their usual form, and to a condition where penal- 
ties are Implied but not specified. It appears that students eidilblt more con* 
fidence In attenptlng to answer nonsense Items (which, .^rf least logically, 
may have a **stlmuius*v8lue" analogous to an extraordinarily difficult legiti- 
mate Item) under the latter conditions. These results are consistent with 
those of Waters, who found that students apparently view compl etely i«i5pocI#IaH 
scoring weights as Indicating zero weights for Incorrect answers. 

The results of the present study Indicate that the problem of conserva- 
tive responding under CTO noted by Hrltz and Jacobs (I.e. all Ss tended to 
Identify too few distractors) may be partially resolved using procedures paral- 
leling those of Jy The question of the effect on test validity and the pos- 
sible Interaction with subject (attribute) variables needs Investigation. Also, 
one Is left wondering what the long term effects of experience with the pro- 
cedures in the present study would be, e.g. after some experience, would Ss 
In Tj and Tj condition behave In a tore similar fashion? Mould an Interaction 
between level of penalty and subject characterlsVlcs develo^ijt 



The present study has Implications for the generel domain known as erf- 
terlon*referenced testing. While most of the effort \fi this area thus far has 



centered around strategies for Item and test developme n t. Item and test adein- 
Istratlon and Item and test analysis, a very fundamental question remelns; what 



of the present study Indicate that behavior observed may vary as a function of 
Item^dlff Iculty, Instructions to thie S, and penalty for Incorrect responses. 




should Ss be told trfien they confront a criterion-referenced test? The results 
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Since behavior Is compered with some criterion or standard, one cannot assume 
that the effect (If any) Is constant across all Ss (whlc*» would permit legiti- 
mate norm-referenced comparisons) therefore of no Importance. One may Instead 
consistently over-or under-estlmate the level of performance of a group of Ss, 
depen,s:ng upon how the criterion Information was generat*d. 
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